41 research outputs found

    Nonnegative principal component analysis for mass spectral serum profiles and biomarker discovery

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>As a novel cancer diagnostic paradigm, mass spectroscopic serum proteomic pattern diagnostics was reported superior to the conventional serologic cancer biomarkers. However, its clinical use is not fully validated yet. An important factor to prevent this young technology to become a mainstream cancer diagnostic paradigm is that robustly identifying cancer molecular patterns from high-dimensional protein expression data is still a challenge in machine learning and oncology research. As a well-established dimension reduction technique, PCA is widely integrated in pattern recognition analysis to discover cancer molecular patterns. However, its global feature selection mechanism prevents it from capturing local features. This may lead to difficulty in achieving high-performance proteomic pattern discovery, because only features interpreting global data behavior are used to train a learning machine.</p> <p>Methods</p> <p>In this study, we develop a nonnegative principal component analysis algorithm and present a nonnegative principal component analysis based support vector machine algorithm with sparse coding to conduct a high-performance proteomic pattern classification. Moreover, we also propose a nonnegative principal component analysis based filter-wrapper biomarker capturing algorithm for mass spectral serum profiles.</p> <p>Results</p> <p>We demonstrate the superiority of the proposed algorithm by comparison with six peer algorithms on four benchmark datasets. Moreover, we illustrate that nonnegative principal component analysis can be effectively used to capture meaningful biomarkers.</p> <p>Conclusion</p> <p>Our analysis suggests that nonnegative principal component analysis effectively conduct local feature selection for mass spectral profiles and contribute to improving sensitivities and specificities in the following classification, and meaningful biomarker discovery.</p

    Graph similarity through entropic manifold alignment

    Get PDF
    In this paper we decouple the problem of measuring graph similarity into two sequential steps. The first step is the linearization of the quadratic assignment problem (QAP) in a low-dimensional space, given by the embedding trick. The second step is the evaluation of an information-theoretic distributional measure, which relies on deformable manifold alignment. The proposed measure is a normalized conditional entropy, which induces a positive definite kernel when symmetrized. We use bypass entropy estimation methods to compute an approximation of the normalized conditional entropy. Our approach, which is purely topological (i.e., it does not rely on node or edge attributes although it can potentially accommodate them as additional sources of information) is competitive with state-of-the-art graph matching algorithms as sources of correspondence-based graph similarity, but its complexity is linear instead of cubic (although the complexity of the similarity measure is quadratic). We also determine that the best embedding strategy for graph similarity is provided by commute time embedding, and we conjecture that this is related to its inversibility property, since the inverse of the embeddings obtained using our method can be used as a generative sampler of graph structure.The work of the first and third authors was supported by the projects TIN2012-32839 and TIN2015-69077-P of the Spanish Government. The work of the second author was supported by a Royal Society Wolfson Research Merit Award

    Seeking Affinity Structure: Strategies for Improving m-best Graph Matching

    Get PDF
    State-of-the-art methods for finding the m-best solutions to graph matching (QAP) rely on exclusion strategies. The k-th best solution is found by excluding all better ones from the search space. This provides diversity, a natural requirement for transforming a MAP problem into a m-best one. Since diversity enforces mode hopping, it is usually combined with a mode-approximation strategy such as marginalisation. However, these methods are generic insofar they do not incorporate the detailed structure of the problem at hand, i.e. the properties of the global affinity matrix which characterise the search space. Without this knowledge, it is thus hard to devise a practical criterion for choosing the next variable to clamp. In this paper, we propose several strategies to select the next variable to clamp, spanning the whole range between depth-first and breadth-first search, and we contribute with a unifying view for characterising the search space on the fly. Our strategies are: a) Number of factors in which the variables participate, b) centrality measures associated with the affinity matrix, and c) discrete pooling. Our experiments show that max number of factors and centrality provide a trade-off between efficiency and accuracy, whereas discrete pooling leads to an improvement of the state-of-the-art

    G-Protein Coupled Receptor Signaling Architecture of Mammalian Immune Cells

    Get PDF
    A series of recent studies on large-scale networks of signaling and metabolic systems revealed that a certain network structure often called “bow-tie network” are observed. In signaling systems, bow-tie network takes a form with diverse and redundant inputs and outputs connected via a small numbers of core molecules. While arguments have been made that such network architecture enhances robustness and evolvability of biological systems, its functional role at a cellular level remains obscure. A hypothesis was proposed that such a network function as a stimuli-reaction classifier where dynamics of core molecules dictate downstream transcriptional activities, hence physiological responses against stimuli. In this study, we examined whether such hypothesis can be verified using experimental data from Alliance for Cellular Signaling (AfCS) that comprehensively measured GPCR related ligands response for B-cell and macrophage. In a GPCR signaling system, cAMP and Ca2+ act as core molecules. Stimuli-response for 32 ligands to B-Cells and 23 ligands to macrophages has been measured. We found that ligands with correlated changes of cAMP and Ca2+ tend to cluster closely together within the hyperspaces of both cell types and they induced genes involved in the same cellular processes. It was found that ligands inducing cAMP synthesis activate genes involved in cell growth and proliferation; cAMP and Ca2+ molecules that increased together form a feedback loop and induce immune cells to migrate and adhere together. In contrast, ligands without a core molecules response are scattered throughout the hyperspace and do not share clusters. G-protein coupling receptors together with immune response specific receptors were found in cAMP and Ca2+ activated clusters. Analyses have been done on the original software applicable for discovering ‘bow-tie’ network architectures within the complex network of intracellular signaling where ab initio clustering has been implemented as well. Groups of potential transcription factors for each specific group of genes were found to be partly conserved across B-Cell and macrophage. A series of findings support the hypothesis

    Information retrieval and text mining technologies for chemistry

    Get PDF
    Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.A.V. and M.K. acknowledge funding from the European Community’s Horizon 2020 Program (project reference: 654021 - OpenMinted). M.K. additionally acknowledges the Encomienda MINETAD-CNIO as part of the Plan for the Advancement of Language Technology. O.R. and J.O. thank the Foundation for Applied Medical Research (FIMA), University of Navarra (Pamplona, Spain). This work was partially funded by Consellería de Cultura, Educación e Ordenación Universitaria (Xunta de Galicia), and FEDER (European Union), and the Portuguese Foundation for Science and Technology (FCT) under the scope of the strategic funding of UID/BIO/04469/2013 unit and COMPETE 2020 (POCI-01-0145-FEDER-006684). We thank Iñigo Garciá -Yoldi for useful feedback and discussions during the preparation of the manuscript.info:eu-repo/semantics/publishedVersio

    GA4GH: International policies and standards for data sharing across genomic research and healthcare.

    Get PDF
    The Global Alliance for Genomics and Health (GA4GH) aims to accelerate biomedical advances by enabling the responsible sharing of clinical and genomic data through both harmonized data aggregation and federated approaches. The decreasing cost of genomic sequencing (along with other genome-wide molecular assays) and increasing evidence of its clinical utility will soon drive the generation of sequence data from tens of millions of humans, with increasing levels of diversity. In this perspective, we present the GA4GH strategies for addressing the major challenges of this data revolution. We describe the GA4GH organization, which is fueled by the development efforts of eight Work Streams and informed by the needs of 24 Driver Projects and other key stakeholders. We present the GA4GH suite of secure, interoperable technical standards and policy frameworks and review the current status of standards, their relevance to key domains of research and clinical care, and future plans of GA4GH. Broad international participation in building, adopting, and deploying GA4GH standards and frameworks will catalyze an unprecedented effort in data sharing that will be critical to advancing genomic medicine and ensuring that all populations can access its benefits

    Even-odd carbon atom disparity

    No full text
    This article does not have an abstract
    corecore